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Abstract. Long-running service workloads (e.g. web search engine) and 
short-term data analysis workloads (e.g. Hadoop MapReduce jobs) co¬ 
locate in today’s data centers. Developing realistic benchmarks to reflect 
such practical scenario of mixed workload is a key problem to produce 
trustworthy results when evaluating and comparing data center systems. 
This requires using actual workloads as well as guaranteeing their sub¬ 
missions to follow patterns hidden in real-world traces. However, ex¬ 
isting benchmarks either generate actual workloads based on probabil¬ 
ity models, or replay real-world workload traces using basic I/O op¬ 
erations. To fill this gap, we propose a benchmark tool that is a first 
step towards generating a mix of actual service and data analysis work¬ 
loads on the basis of real workload traces. Our tool includes a com¬ 
biner that enables the replaying of actual workloads according to the 
workload traces, and a multi-tenant generator that flexibly scales the 
workloads up and down according to users’ requirements. Based on this, 
our demo illustrates the workload customization and generation process 
using a visual interface. The proposed tool, called BigDataBench-MT, 
is a multi-tenant version of our comprehensive benchmark suite Big- 
DataBench and it is publicly available from http://prof.ict.ac.cn/ 
BigDataBench/multi-tenancyversion/ 
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1 Introduction 

In modern cloud data centers, a large number of tenants are consolidated to share 
a common computing infrastructure and execute a diverse mix of workloads. 
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Benchmarking and understanding these workloads is a key problem for system 
designers, programmers and researchers to optimize the performance and energy 
efficiency of data center systems and to promote the development of data center 
technology. This work focuses on two classes of popular data center workloads 

m- 

- Long-running services. These workloads offer online services such as web 
search engines and e-commerce sites to end users and the services usually 
keep running for months and years. The tenants of such workloads are service 
end users. 

- Short-term data analysis jobs. These workloads process input data of many 
scales using relatively short periods (e.g. in Google and Facebook data cen¬ 
ters, a majority (over 90%) of analytic jobs complete within a few minutes 
[30146] ). The tenants of such workloads are job submitters. 

As data analysis systems such as Hadoop and Spark mature, both types 
of workloads widely co-locate in today’s data centers, hence the pressure to 
benchmark and understand these mixed workloads rises. Within this context, 
we believe that it will be of interest to the data management community and a 
large user base to generate realistic workloads such that trustworthy benchmark¬ 
ing reflecting the practical data center scenarios can be conducted. Considering 
the heterogeneity and dynamic nature of data center workloads and their ag¬ 
gregated resource demands and arrival patterns, this requires overcoming two 
major challenges. 

Benchmarking using actual workloads based on real-world work¬ 
load traces. Data analysis jobs usually have various computation semantics 
(i.e. implementation logics or source codes) and input data sizes (e.g. ranging 
from KB to PB), and their behaviors also heavily rely on the underlying software 
stacks (such as Hadoop or MPI). Hence it is difficult to emulate the behaviors 
of such highly diverse workloads just using synthetic workloads such as I/O op¬ 
erations. On the other hand, generating workloads whose arrival patterns follow 
real-world traces is an equally important aspect of realistic workloads. This is 
because these traces are the most realistic data sources including both explicit 
and implicit arrival patterns (e.g. sequences of time stamped requests or jobs). 

Benchmarking using scalable workloads with realistic mixes. A good 
benchmark needs to flexibly adjust the scale of workloads to meet the require¬ 
ments of different benchmarking scenarios. Based on our experience, we noticed 
that in many cases, obtaining real workload traces is difficult due to confidential 
issues. The limited trace data also restrict the scalability of benchmark. It is 
therefore challenging to produce workloads at different scales while still guaran¬ 
teeing their realistic mix corresponding to real-world scenarios. 

In this paper, we propose a benchmark tool that is a first step towards gener¬ 
ating realistic mixed data center workloads. This tool, called BigDataBench-MT, 
is a multi-tenancy version of our open-source project BigDataBench, which is a 
comprehensive benchmark suite including 14 real-world data sets and 33 actual 
workloads covering five application domains msi\- The goal of BigDataBench- 
MT is not only supporting the generation of service and data analysis work- 
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loads based on real workload traces, but also providing a multi-tenant frame¬ 
work to enable the scaling up and down of such workloads with guarantee 
of their realistic mixes. Considering our community may feel interest in us¬ 
ing these workloads to evaluate new system designs and implementations, our 
tool and the corresponding workload traces are publicly available from http: 
//prof.ict.ac.cn/BigDataBench/multi-tenancyversion/, 


2 Related Work 

We now review existing data center benchmarks from three perspectives, as 
shown in Table [TJ 

Evaluated platform. First of all, we classify data center benchmarks ac¬ 
cording to their targeted systems. We consider three popular camps of systems 
in today’s data centers: (1) Hadoop-related systems: the great prosperity of the 
Hadoop-centric systems in industry brings a wide diversity of systems (e.g. Spark 
[S], HBase[T], Hive [2] and Impala [5] ) on top of Hadoop MapReduce and HDFS 
m as well as a wide range of benchmarks specifically designed for these sys¬ 
tems. (2) Data stores: parallel DBMSs (e.g. MySQL [T2] and Oracle [T5]) and 
NoSQL data stores (e.g. Amazon Dynamo [33], Cassandra [41] and Linkedin 
Voldemort |50p also widely exist in data centers. (3) Web services: long-running 
web services such as Nutch search engine [^ and multi-tier cloud applications 
|36j are another important type of data center applications. These services usu¬ 
ally have stringent response time requirement m and their request processing 
is distributed into a large number of service components for parallel processing, 
thus the service latency is determined by the tail latency of these components 

^SM- 

Workload implementation logic. Consider the complexity and diversity 
of workload behaviors in current data center systems, the implementation logic 
of existing data center benchmarks can be classified into three categories. The 
first category of benchmarks implement their workloads with algorithms. For ex¬ 
ample, HiBench |39j include workloads implemented with machine learning algo¬ 
rithms in Mahout [1] . The second category of benchmarks implement workloads 
using database operations such as reading, loading, joining, grouping, unifying, 
ordering, aggregating and spliting data. The third category of benchmarks im¬ 
plement workloads as I/O operations. For example, NNBench [13] and TestDF- 
SIO [23] emulate I/O operations on Hadoop HDFS; GridMix [TO] provides two 
workloads: LoadJob that performs I/O operations and SleepJob that sleeps the 
jobs; and SWIM [50] provides four workloads that stimulate the operations of 
Hadoop jobs to read, write, shuffle and sort data. We view the first two categories 
of workloads as actual workloads, because these workloads have semantics and 
they consume resources of processors, memories, caches and I/O bandwidths in 
execution. By contrast, workloads belonging to third category only consume I/O 
resources. 

Workload mix. Finally, we classify data center benchmarks into three cat¬ 
egories from the perspective of workload mix. The first type of data center 
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Table 1. Overview of data center benchmarks 


Workload 

mix 

Workload implementation logic 

Algorithms 

Database operations 

I/O operations 

No mix 

WordCount ^[26J, 

Grep '^[5], Sort^fT^. 

Terasort^ |22|. HiBench^ 
[31], TPCx-HS^SS], 

Graphalytics^ [H] , 
GloudSuite"^ |34j 

MRBench^ |40j , 

CALDA^im, AMPLab 

benchmark^ |7] , YCSB^[3T|. 
BG benchmark^ gS] , 

CloudSuite'* [31] 

NNBench*[Tg, 

TestDFSIO*[13], 

HiBD*[49] 

Synthetic 

mix 

HcBench^ 1471. 

MEBS^gH] 

PigMix^ [17] , HcBench^ [47] , 
MRBS* [48], BigBench2[35], 
LinkBench^|27|. TPC 

benchmarks^ |24| . TPC- 

W^[25], BigDataBench* [5T] 

HiBench^ [391. 
SPECWeb99®[20] 

Realistic 

mix 



Gridmix^ [1^ , 
SWIM*l2Tl30l 


1 

Hadoop-related systems 


Data stores 
Web services 

All three types of systems 


benchmarks either generate single workloads (e.g. WordCount [55], Grep [3] 
and Sort [IS]) or generate multiple workloads individually (e.g. CALDA |^ . 
AMPLab benchmark [7] and CloudSuite [SI]). That is, these benchmarks donot 
consider workload mix. The second category of benchmarks generate synthetic 
mixes of workloads. Many benchmarks (e.g. PigMix m, HcBench m and Big- 
Bench |35| 1 generate mixes of workloads by manually determining their propor¬ 
tions. Similarly, TPC benchmarks [24j design a query set as a synthetic mix of 
queries with different proportions. YCSB El uses a package to include a set of 
related workloads. MRBS decides the frequencies of different workloads using 
probability distributions such as a random distribution. Finally, third category 
of benchmarks generate a realistic mix of synthetic workloads whose arrival pat¬ 
terns faithfully follow real-world traces. For example, GridMix (TU] and SWIM 
|21l30j first build a job trace to describe the realistic job mix by mining pro¬ 
duction loads, and then run synthetic I/O operations according to the trace. 
However, how to generate actual workloads on the basis of real workload traces 
is still an open question. 

3 System Overview 

Figure [T] shows the framework of our benchmark tool. It consists of three main 
modules. In the Benchmark User Portal, users can first specify their benchmark¬ 
ing requirements, including the machine type and number to be tested, and the 
types of workload to use. A set of workload traces following these requirements 
are then selected. The next step of Combiner of Workloads and Traces is to 
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match the real workload and the selected workload traces, and outputs work¬ 
load replaying scripts to guide the workload generation. Finally, the Multi-tenant 
Workload Generator extracts the tenant information from the scripts and con¬ 
structs a multi-tenant framework to generate a mix of service and data analysis 
workloads. 

In BigDataBench-MT, we employ the Sogou user query logs m as the ba¬ 
sis to generate the service workload (i.e. the Nutch search engine 0 ) and the 
Google cluster workload trace as the basis to generate data analysis workloads 
(i.e. Hadoop and Shark workloads). The Sogou trace records logs from 50 days 
and it includes over 9 million users and 43 million queries. The Google trace 
records logs from 29 days and 12,492 machines and it includes over 5K users, 40K 
workload types, lOOOK jobs and 144 million tasks. As a preprocessing step, we 
converted both traces into Impala databases (full version) and MySQL database 
(24-hour version) to facilitate the customization of benchmarking scenarios. In 
the following subsections, we describe the last two modules of our tool. 



Submit Submit Submit Submit 

requests requests jobs jobs 


Fig. 1. The BigDataBench-MT framework 


3.1 Combiner of Workload and Traces 

The goal of the combiner is to extract the request/job arrival patterns from 
real-world traces and combine them with actual workloads. The combiner applies 
differentiated combination techniques to the service and data analysis workloads 
because their workload generations have different features. 

Service workloads. The generation of a service workload is determined 
by three factors: the request submitting time, the sequence of requests and the 
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content of each request query. Take the web search engine for example, the 
combiner implements a request submitting service that automatically derives 
these factors from the Sogou trace and uses them to determine the request 
submission process. 

Data analysis workloads. The generation of a data analysis workload is 
determined by four factors: the job submitting time, the workload type (i.e. the 
computation semantics and the software stack) and the input data (i.e. the data 
source and size). The current workload traces usually show the information of job 
submitting time but only provide anonymous jobs whose workload types and/or 
input data are unknown. Hence the basic idea of the combiner is to derive the 
workload characteristics of both actual and anonymous jobs and then match jobs 
whose workload characteristics are sufhciently similar. Table lists the metrics 
used to represent workload characteristics, which reflect both jobs’ performance 
(execution time and resource usage) and micro-architectural behaviors (CPI and 
MAI). 


Table 2. Metrics to represent workload characteristics of data analysis jobs 


Metric 

Description 

Execution time 

Measured in seconds 

CPU usage 

Total CPU time per second 

Total memory size 

Measured in GB 

CPI 

Cycles per instruction 

MAI 

The number of memory accesses 
per instruction 


Figure shows the process of matching actual data analysis jobs and traces’ 
anonymous jobs and it consists of two parallel sub-processes. First, the actual 
jobs with different input data sizes are tested and their metrics of workload char¬ 
acteristics are collected. In BigDataBench-MT, we provide auto-running scripts 
to collect performance metrics and hardware performance counters (Perf [16] and 
Oprofile m for Linux 2.6-1- based systems) to obtain micro-architectural met¬ 
rics. Using the testing results as samples, the combiner trains the multivariate 
regression model to describe the relationship between an actual job (includ¬ 
ing both its workload type and input size as the independent variables) and its 
workload characteristic metrics (one metric is a dependent variable). Second, the 
combiner views each anonymous job as an entity and the five workload charac¬ 
teristic metrics as its attributes, and employs the Bayesian Information Criterion 
(BlC)-based k-means clustering algorithm [1^ to group anonymous jobs in the 
trace into different clusters. 

Based on the constructed regression models and clusters, the combiner fur¬ 
ther matches each cluster to one actual job with a specific input data. In the 
matching, the coefficient of variation (CV) measure, defined as the ratio of the 
standard deviation cr to the mean /x, is used to describe the dispersion of jobs 
in the same cluster. The combiner iteratively tests actual jobs of different work- 
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load types and input sizes, and matches an actual job with a cluster under two 
conditions: (i) the CV of the cluster is smaller than a specihed threshold (e.g. 
0.5), which indicates the anonymous jobs in this cluster are closely similar to 
each other; (ii) the change in this CV is smaller than a threshold (e.g. 0.1) after 
the actual job is added to the cluster. This means the workload characteristics 
of the added job are sufficiently similar to those of the anonymous jobs in the 
cluster. If multiple matched actual jobs are found for one cluster, the combiner 
selects the job resulting the smallest CV change. Finally, the combiner produces 
workload replaying scripts as the output. 


Actual 

data 

anahtic 

\M')rkloads 


Regression models Workload repkn mg scripts 




Clusters of 
anonymous jobs 

o°oo 


L^ser 

id 

Workload 

type 

Input 

size 

Submit 

time 

001 

Sort 

IMB 

00:10 

. ■■■ 




001 

Bayes 

2GB 

58:20 





300 

K-means 

1GB 

59:05 


Fig. 2. The matching process of real and synthetic data analysis jobs 


Note that BigDataBench-MT provides two ways of using the above com¬ 
biner. First, it directly provides some workload replaying scripts, which are the 
combination results of representative actual workloads (e.g. Hadoop Sort and 
WordCount) and the Google workload trace. Second, it also supports bench¬ 
mark users to directly use the above combination technique to match their own 
data analysis jobs with Google anonymous jobs. 


3.2 Multi-tenant Workload Generator 

Based on workload replaying scripts, the workload generator applies a multi¬ 
tenant mechanism to generate a mix of workloads using two steps. First, the 
generator extracts the tenant information from the scripts. For the service and 
data analysis workloads, this tenant information represents the number of con¬ 
current end users and submitters of analytic jobs, respectively. Second, the gener¬ 
ator creates a client for each tenant and emulates the scenarios that a number of 
end users/job submitters concurrently submit requests/jobs to the system. This 
multi-tenant framework allows the flexible adjustment of workload scales with 
guarantee of their realistic mixes. For example, benchmark users can double or 
halve the size of concurrent tenants, after which the distributions of requests/jobs 
submitted by these tenants still correspond to those in real workload traces. 
















Authors Suppressed Due to Excessive Length 


4 Demonstration Description 

4.1 Chosen Workloads and Workload Traces 

In our demonstration, benchmark users want to evaluate their data center sys¬ 
tems using a mix of service and data analysis workloads. The Nutch web search 
engine [5] is used as the example service workload and four Hadoop workloads 
are used as the example data analysis workloads. The chosen Hadoop work¬ 
loads have a variety of workload characteristics: WordCount and Naive Bayes 
classihcation are typical CPU-intensive workloads with integer and float point 
calculations; Sort is the typical I/O-intensive workload and Pageindex is the 
workload having similar demands for CPU and I/O resources. Both the data 
generators [42] and workloads in the demo can be obtained from BigDataBench 
0 . 

Our demo uses a 24-hour user query logs from Sogou, which include 1,724,264 
queries from 519,876 end users, as the basis to generate realistic search engine 
service; and uses a 24-hour cluster workload trace from Google, which includes 
37,842 anonymous jobs from 2,261 job submitters, as the basis to generate real¬ 
istic Hadoop jobs. 


4.2 System Demonstration 

BigDataBench-MT provides a visual interface in the Benchmark User Portal 
to help benchmark users make appropriate benchmarking decisions. This por¬ 
tal provides users necessary information, allows them input benchmarking re¬ 
quirements and executes system evaluations on their behalf. The whole process 
consists of three steps, as shown in Figures [^[^ and respectively. 

Step 1. Specification of tested machines and workloads. The first step of the 
demo presents an overview of workload traces (i.e. Sogou and Google traces) and 
the data center status, including the six types of machines, their machine number 
and configurations, and the user, job and task statistics in these machines. This 
information assists benchmark users to select the type and number of machines 
to be evaluated, and the workloads they want to use. Suppose users select Type 
Four of the machines with 2 process cores and 4GB memory and 100 machines 
to be tested, the workload traces belonging to these machines are extracted and 
forwarded to the next step. 

Step 2. Selection of benchmarking period and scale. At this step, users have 
the option to select the period and scale for their benchmarking scenarios. To 
facilitate this selection, BigDataBench-MT shows the statistic information of 
both the service workload (including its number of requests and end users per 
second) and the data analysis workloads (including their number of jobs and 
average CPU, memory resource usages) at each of the 24 hours. Suppose users 
select the benchmarking period of 12:00 to 13:00 and the scale factor is 1 (that is, 
no scaling is needed). The workload traces belonging to this period are selected 
for step 3. 
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Machine Type 
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Type Three 
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Fig. 3. System Demonstration Screenshots: Step 1 
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Fig. 4. System Demonstration Screenshots: Step 2 
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Step 3. Generation of mixed workloads. After both workloads and traces 
have been selected, the final step employs the combiner described in Section [O] 
to generate workload replaying scripts for both the service and data analysis 
workloads, and sends these scripts as feedback to users. In the matching of 
actual Hadoop jobs with anonymous ones, we tested each Hadoop workload 
type using 20 different input sizes to build the regression models. Based on the 
replaying scripts, benchmark users can press the ’’Generate mixed workload” 
button to trigger the multi-tenant workload generator, in which each tenant 
is an independent workload generator and multiple tenants generate a mix of 
realistic workloads. 


Multi-tenancy version of BigDataBench 


I Generate Mixed Workload | 


Replaying Scripts of Service Workload 


Replaying Scripts of Data Analytic Workload 


12:00:00 

12:00:00 

12:00:00 

12:00:00 

12:00:00 

12:00:00 

12:00:00 

12:00:00 


SW00003 -tin 

SW00004 

SW00003 

SW00006 

SW00007 

SW00008 

SW00009 

SW00010 



Fig. 5. System Demonstration Screenshots: Step 3 


5 Future Work 

There are multiple avenues for extending the functionality of our benchmark 
tool. A first step will be to support more actual workloads. Given that there 
are 33 actual workloads in the BigDataBench and many workloads (e.g. Bayes 
classification and WordGount) have three versions of implementations (Hadoop, 
Spark and MPI), adding more workloads to BigDataBench-MT will be helpful to 
support wider benchmarking scenarios. We also plan to extend our multi-tenant 
workload generator to support different classes of tenants and allow users to 
apply different priority disciplines in workload generation. 
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